[RDF] Add internal GetDatasetGlobalClusterBoundaries utility to retrieve cluster entry ranges#21768
Conversation
|
@vepadulano the multithreaded approach may not be so straightforward: the global offset depends on the cumulative entry count from previous files, opened sequentially We could maybe think of a two step approach, collecting clusters in parallel and then sequentially adjusting with a global offset somehow preserving the original order of the files.. |
Test Results 22 files 22 suites 3d 6h 37m 20s ⏱️ For more details on these failures, see this check. Results for commit 5442dae. ♻️ This comment has been updated with latest results. |
cef5c00 to
395d045
Compare
vepadulano
left a comment
There was a problem hiding this comment.
I think a better function naming would be GetClusterBoundaries. More in general, I think we should take this chance to merge the functionalities of ROOT::Internal::TreeUtils::GetClustersAndEntries and ROOT::Internal::RDF::GetClustersAndEntries.
All in all, let's discuss how much extra work would be to implement GetClustersAndEntries as one function in the Internal::RDF namespace that dispatches the cluster boundary + number of entries retrieval depending on whether the input dataset is TTree or RNTuple
57dd8d0 to
58b3d7c
Compare
|
@vepadulano as a followup to the conversation we had earlier, I rewrote |
vepadulano
left a comment
There was a problem hiding this comment.
Very nice! I left some comments to refine the PR.
58b3d7c to
77ae690
Compare
77ae690 to
8dab344
Compare
8dab344 to
5442dae
Compare
This Pull request:
Adds
GetDatasetGlobalClusterBoundariesas an internal utility to retrieve entry ranges for each cluster in aTTreeorRNTuplebasedRDataFrame.When possible, the files are processed in parallel with a
ROOT::TThreadExecutor.It returns a list of cluster boundaries across files, using a global offset.
This utility is required by the
RDataLoaderto shuffle and prefetch data for ML training.Changes
RNTupleDS: addGetDatasetGlobalClusterBoundariesas a friend function to access private membersfNTupleNameandfFileNamesRNTupleDS: setfNTupleNamein the single file constructor (like the multi file constructor)RInterface: addGetDatasetGlobalClusterBoundaries()implementation for bothTTreeandRNTupledatasources (dispatches to existingGetClustersAndEntriesimplementations)